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Little is known about novel genetic elements that drove the emergence of anthropoid primates. We exploited the se- 
quencing of the marmoset genome to identify 23,849 anthropoid-specific constrained (ASC) regions and confirmed their 
robust functional signatures. Of the ASC base pairs, 99.7% were noncoding, suggesting that novel anthropoid functional 
elements were overwhelmingly cw-regulatory. ASCs were highly enriched in loci associated with fetal brain development, 
motor coordination, neurotransmission, and vision, thus providing a large set of candidate elements for exploring the 
molecular basis of hallmark primate traits. We validated ASC192 as a primate-specific enhancer in proliferative zones of the 
developing brain. Unexpectedly, transposable elements (TEs) contributed to >56% of ASCs, and almost all TE families 
showed functional potential similar to that of nonrepetitive DNA. Three L1PA repeat-derived ASCs displayed coherent 
eye-enhancer function, thus demonstrating that the "gene-battery" model of TE functionalization applies to enhancers in 
vivo. Our study provides fundamental insights into genome evolution and the origins of anthropoid phenotypes and 
supports an elegantly simple new null model of TE exaptation. 



[Supplemental material is available for this article.] 

Most whole-genome studies of human evolution have focused on 
the —5-7 million years since the human-chimpanzee divergence 
(Pollard et al. 2006; Prabhakar et al. 2006a, 2008; Haygood et al. 
2007; McLean et al. 2011). However, many quintessentially hu- 
man traits are merely extensions of earlier evolutionary changes 
that appeared in the ancestors of anthropoid primates (Cartmill 
1974; Williams et al. 2010). From a biomedical perspective, these 
earlier primate-specific changes potentially underlie many of the 
limitations of nonprimate disease models. Primate-specific changes 
are also likely to be far more numerous than human-specific alter- 
ations, since the former accumulated over a longer timespan: —47 
million years versus —6 million years (Methods). However, with the 
exception of studies that focus on transcribed genes (Enard et al. 
2002; Varki et al. 2008; Tay et al. 2009; Pierron et al. 2012), very little is 
known about the DNA sequences that drove the emergence of an- 
thropoid primates. 

The contribution of transposable elements (TEs) to species 
evolution is a topic of intense interest. The early view was that TEs, 
which constitute at least 48% of the human genome, were essen- 
tially genomic parasites, although they could occasionally con- 
tribute new biological functions purely by chance (Doolittle and 
Sapienza 1980; Orgel and Crick 1980). As knowledge of genomic 
function expanded, many notable examples of TE-derived human 
ris-regulatory elements were identified (Feschotte 2008; Rebollo 
et al. 2012), but the overall evolutionary impact of TEs was still 
unclear. Thanks to the advent of whole-genome biochemical as- 
says, we now know that TEs contribute massively to genomic 
transcription factor (TF) binding, chromatin openness, focal his- 
tone modification, and tissue-specific DNA methylation (Johnson 
et al. 2006; Marino-Ramirez and Jordan 2006; Wang et al. 2007; 
Kunarso et al. 2010; Kelley and Rinn 2012; Schmidt et al. 2012; 
Chuong et al. 2013; Jacques et al. 2013; Xie et al. 2013). However, 
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such studies are indicative only of biochemical activity, i.e., bio- 
chemical changes at the molecular level. These molecular pro- 
cesses do not always have an impact on organismal phenotypes or 
fitness (Eddy 2012; de Souza et al. 2013; Doolittle 2013; Niu and 
Jiang 2013). In other words, biochemical activity does not neces- 
sarily imply biological function. Since transposon-derived se- 
quences are, by definition, bound by host proteins at some point in 
their life cycle and also actively suppressed by the host genome 
in most cell types, it is only to be expected that they should be 
enriched for biochemical activity even in the absence of any im- 
pact on species traits (Eddy 2012). Thus, it is important that we 
analyze the contribution of TEs to human evolution using maps of 
biological, rather than biochemical, function. 

It has been suggested (de Souza et al. 2013) that primate se- 
quence constraint analysis, also known as "phylogenetic shad- 
owing" (Boffelli et al. 2003), could provide a more accurate view of 
the biological functions of TEs. Another complementary strategy is 
to characterize allele frequency skews in human populations (de 
Souza et al. 2013). The advantage of these approaches is that they 
are based on measures of natural selection, and therefore directly 
indicative of functions that are biologically relevant. However, 
since the requisite primate genome sequences were not initially 
available, multiple studies used mammalian sequence compari- 
sons to identify ancient examples of TE functionalization (Cooper 
et al. 2005; Kamal et al. 2006). Two such studies estimated that 5%- 
10% of mammal-specific functional elements were TE-derived 
(Lowe et al. 2007; Mikkelsen et al. 2007). However, it was acknowl- 
edged that these figures based on ancestral mammalian gain of 
function (>100 Mya) may represent severe underestimates, since 
ancient TEs are poorly annotated. In contrast, recently functional- 
ized TEs are likely to still be recognizable as repetitive elements. It is 
therefore imperative that we assess the impact of TEs through whole- 
genome analysis of sequences that (1) show evidence of natural se- 
lection; and (2) became functional only recently (i.e., in the primate 
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lineage). Such a strategy could also be used to examine Britten and 
Davidson's "gene battery" hypothesis, which states that multiple 
genes may be coregulated via insertion of near-identical TEs in their 
promoter regions (Britten and Davidson 1969). 

Here, we exploit the sequencing of the marmoset genome 
(The Marmoset Genome Sequencing and Analysis Consortium 
2014) to address the questions described above. Being a New World 
monkey, the marmoset, which diverged —43 Mya (Hedges et al. 
2006), lies on the most distant branch of the anthropoid family 
tree (Fig. 1A). Consequently, it provides sufficient statistical power 
to detect anthropoid-specific constrained sequences at the length 
scale of ds-regulatory modules and large exons (Prabhakar et al. 
2006b; Wang et al. 2006). Such sequences are likely to have gained 
new functions that set anthropoids, or perhaps all primates, apart 
from distant mammals. Thus, anthropoid-specific constrained se- 
quence analysis constitutes a straightforward method for revealing 
the molecular drivers, both TE-derived and otherwise, of primate 
evolution genome-wide. 

Results 

Identification and validation of anthropoid-specific 
constrained elements (ASCs) in the human genome 

As in our previous locus-specific analyses of primate sequence 
alignments (Prabhakar et al. 2006b; Wang et al. 2006), we defined 
ASCs as sequences that show strong constraint among anthropoid 
primates but little or no constraint among nonprimate mammals 
(Methods; Boffelli et al. 2003). We used a representative selection 
of the anthropoid primate genomes that have been sequenced to 
draft quality or better: human, orangutan, rhesus macaque, and 
common marmoset. Anthropoid constrained elements (ACs) were 
detected genome-wide in global alignments of multispecies syn- 
tenic regions. At a strict P-value threshold of 1 x 10~ 3 , we dis- 
covered 268,000 ACs covering 4.2% of the syntenic genome 
(FDR<0.1%) (see Methods; Supplemental Note 1). In order to pri- 
oritize unambiguous examples of gain of function, we discarded 
from this set any element that showed even weak evidence of 
nonprimate mammalian constraint. We then used fastDNAmL 
(01senetal.l994)to independently validate the lineage-specificity 
of constraint in the remaining elements and thus defined a final 
set of 23,849 ACs as ASCs (Methods). ASCs covered 8.7 Mbp (0.34%) 
of the syntenic genome and had a median length of 276 bp (Sup- 
plemental Figs. SI, S2; Supplemental Table SI). 

To independently test the effect of ASCs on fitness, we ex- 
amined the distribution of derived-allele frequencies at single- 
nucleotide polymorphisms (SNPs) within the ASC set (Fig. 1C). For 
this analysis, we chose SNPs detected in African populations be- 
cause they exhibit high genetic diversity (The 1000 Genomes 
Project Consortium 2010). It has been noted that genomic scans 
for evolutionary constraint tend to enrich for SNPs at which the 
reference genome carries the ancestral allele (Ward and Kellis 
2013). The constrained element allele frequency spectra shown in 
Figure 1C were calculated in a manner that corrects for this bias 
(Supplemental Note 2). Relative to the average in syntenic regions, 
SNPs within ASCs showed a highly significant shift toward lower 
frequencies (P = 5 x 10~ 22 ) (Fig. 1C; Supplemental Note 2), in- 
dicating that mutations in ASCs are more deleterious than muta- 
tions at random genomic locations. Thus, ASCs as a group show 
evidence of natural selection, and therefore biological function, in 
humans. In order to also test for biochemical functionality, we 
intersected ASCs with DNase I hypersensitive sites from 84 human 



cell lines (Thurman et al. 2012). We found twofold enrichment of 
hypersensitive base pairs at ASCs, which was even slightly greater 
than the enrichment observed for a widely used whole-genome set 
of placental-constrained elements (Fig. ID; Supplemental Note 3; 
Siepel et al. 2005). These results suggest that, with regard to regu- 
latory and biological function, ASCs are equivalent to constrained 
elements identified by other means. 

We compared our ASCs to other sets of functional elements 
(Fig. 1C): ACs, placental-constrained sequences, coding exons, and 
also primate-specific DNase I hypersensitive sites (Supplemental 
Note 4; Jacques et al. 2013). All five sequence sets showed statis- 
tically significant allele frequency shifts (Supplemental Table S2) — 
clearly, they were all influenced by natural selection. To quantify 
the per-base-pair impact ("effect size") of natural selection, we 
calculated the contribution of low-frequency derived alleles to 
total heterozygosity (Supplemental Note 2). ASCs showed a sub- 
stantial frequency shift: 36.20% of heterozygosity derived from 
SNPs with derived allele frequency <15% versus 33.55% genome- 
wide (Fig. 1C). In contrast to this excess of 2.65 percentage points 
in ASCs, primate-specific hypersensitive sites showed only mini- 
mal allele frequency suppression (0.14 percentage points) (Fig. 1C). 
Thus, lineage-specific constrained elements are more relevant to 
organismal fitness than lineage-specific biochemically active regions. 

Interestingly, ASCs were enriched only in distal nonexonic 
regions (P = 0) relative to the average AC (Fig. IE; Supplemental 
Note 5). Only 1.2% of ASC bases overlapped noncoding RNAs, and 
no enrichment was observed in this category relative to ACs. Un- 
translated regions (UTRs) were depleted by a factor of 5 (P ~ 0), 
suggesting limited contribution of proximal regulatory elements 
to anthropoid innovation. Most notably, although 13.1% of AC 
base pairs overlapped coding regions, only 0.28% of ASCs were 
protein-coding — a 47-fold reduction (P ~ 0). This bias against gain 
of functional coding sequence is stronger than previously believed 
(Mikkelsen et al. 2007). It is possible that the earlier estimate 
(14-fold) was biased by the fact that it only included noncoding 
regions showing pan-eutherian constraint. Noncoding functional 
elements, which tend to evolve more rapidly, would therefore be 
undercounted (Meader et al. 2010). Since 99.7% of ASC base pairs 
are noncoding and only 1 .2% overlap noncoding RNAs, it is evident 
that functional blocks gained in ancestral primates were over- 
whelmingly ds-regulatory (King and Wilson 1975; Carroll 2003). 

Overarching functional themes among nonexonic ASCs 

In order to examine the in vivo regulatory functions of recently 
gained elements during human development, we intersected ACs 
and ASCs with tissue-specific DNase I hypersensitive sites from 
eight human fetal tissues (Bernstein et al. 2010). Notably, ASCs 
were significantly more likely than ACs to overlap hypersensitive 
sites in human fetal brain and fetal thymus (Fig. 2A; Supplemental 
Note 6). These enriched ASCs thus provide a promising set of 
candidate regulatory regions for exploring the genetic under- 
pinnings of the profound alterations in primate brain devel- 
opment seen in comparative and fossil studies (Kaas 2006) and also 
primate-specific immune adaptations. 

In order to identify functionally coherent sets of molecular 
changes in anthropoid evolution on a larger scale, we used the 
GREAT tool (McLean et al. 2010) to test for enriched annotations in 
the flanking genes of nonexonic ASCs relative to ACs (Methods). 
GREAT avoids systematic biases by explicitly controlling for 
the larger size of genomic loci associated with certain biological 
functions (for example, neurodevelopment). We filtered enriched 
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Figure 1. Functionality and genomic distribution of ASCs. (A) Phylogenetic tree: anthropoids (blue) and nonprimate mammals (red). (6) Illustrative 
example: Sequence alignment of /ISO 9060 shows constraint in anthropoids and unconstrained divergence among nonprimate mammals. Dots represent 
identical nucleotides. (C) Derived allele frequency spectra of African (AFR) SNPs from the 1000 Genomes Project. SNPs within ASCs are shifted to lower 
frequencies (<15%), indicating ongoing negative selection in humans. In contrast, SNPs within biochemically defined primate-specific DNase I hyper- 
sensitive (HS) sites Qacques et al. 201 3) show only a weak frequency shift. (D) ASCs are enriched for DNase I hypersensitivity in 84 human cell lines. (£) 
Distribution of constrained elements in the human genome. (Green subplot) Of ASC base pairs, 0.3% are protein-coding relative to 1 3.1% for ACs. 
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functional terms for "jackpot" effects arising from an excess of 
ASCs near a single gene (Methods). We also avoided cherry-picking 
of GREAT results by focusing on the top three enriched terms in 
each of six functional ontologies (Fig. 2B; Supplemental Table S3). 
Remarkably, almost all the top enriched annotations matched 
traits known to have evolved uniquely among primates. 

We found massive enrichment of nonexonic ASCs near 
Kriippel-associated box (KRAB) genes (FDR Q-value = 6 x 1CT 65 ) 
and zinc finger genes (Q = 5 x 1CT 17 ). Both signals derive from 
ASCs flanking KRAB-zinc finger (KZNF) genes (Fig. 2B; Supple- 
mental Table S3), which direct repressive chromatin to retroviral 
sequences and other genomic loci (Rowe et al. 2010). Within the 
InterPro ontology, we also found a strong excess of ASCs near sub- 
strate transporter genes (Q = 1.4 x 10~ u ). Notably, when these 
genes were sorted by their individual enrichment for ASCs, seven of 
the top 14 were either annotated as urate transporters, or adjacent to 
GWAS SNPs for serum urate level or both (Supplemental Tables S3, 
S4; Anzai and Endou 2011). Thus, it is possible that, in addition to 
the known changes in urate homeostasis among apes (Oda et al. 



2002), urate levels also evolved through ris-regulatory gain of func- 
tion in the lineage leading to anthropoids. 

Nonexonic ASCs also showed strong enrichment near trans- 
membrane protein precursors (Q = 1 x 10~ 19 ), all of which derived 
from a single locus containing TMEM132B, TMEM132C, and 
TMEM132D. These genes show robust genetic associations with 
anxiety, depression, attempted suicide, and panic disorder, and 
also differential expression in fear-related brain regions (Hindorff 
et al. 2009; Erhardt et al. 2011). Interestingly, 53/123 ASCs in this 
locus overlap open-chromatin regions detected by DNase-seq in 
human fetal brain, adult brain, or neuronal cell lines (Fig. 3A; 
Supplemental Note 7), suggesting a potential role for ASCs in the 
evolution of these behavioral traits. 

Gamma-aminobutyric acid A (GABA A ) neurotransmitter re- 
ceptors were among the top three ASC-enriched gene classes 
within both the InterPro and TreeFam ontologies. Moreover, GABA 
synthesis genes were also enriched for ASCs (PANTHER Pathway 
ontology) (Supplemental Table S3). The excess of ASCs near genes 
expressed in the developing eye lens was also almost entirely at- 
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Figure 3. ASC-enriched genes are associated with disease. Three genomic loci that have accumulated a significant excess of ASCs and also harbor 
disease-associated SNPs: (A) TMEM132; (B) OCA2; (C) POLN. Numerous ASCs overlap open chromatin regions (vertical red lines). ASC enrichment is 
computed from a moving window of 1 0 background ACs. (D) Fourteen of the top 20 genes enriched for flanking nonexonic ASCs are associated with 
human disease, of which 1 1 relate to the brain or eye (red). 
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tributable to GABA A receptor genes and other GABA-pathway 
genes. We also observed ASC enrichment near histaminergic, se- 
rotonergic, adrenergic, and dopaminergic receptors (TreeFam: 
TF3 16350) (Supplemental Table S3). Moreover, ASCs were enriched 
near heterotrimeric G-protein signaling pathway genes (PANTHER 
Pathway ontology) (Fig. 2B), which include numerous neurotrans- 
mitter receptors and their downstream effectors. Thus, neurotrans- 
mission genes in general, and GABA A pathway genes in particular, 
were preferentially targeted for anthropoid-specific gain of non- 
exonic function. 

Intriguingly, ASCs were strongly enriched near genes related 
to abnormal melanogenesis. This was initially a surprising result, 
since anthropoid primates are not known to share any unique 
external coloration phenotypes. However, upon closer inspection, 
we noticed that mutations in four of the five ASC-flanked mela- 
nogenesis genes (OCA2, LYST, HPS6, and BLOC1S4) result in de- 
fective vision and ocular malformations (Oetting and King 1999), 
suggesting a potential link between these ASCs and primate visual 
acuity (Martinez-Morales et al. 2004; Nickla and Wallman 2010). 
We also found ASC enrichment near human genes associated with 
clumsiness, postnatal microcephaly, and progressive gait ataxia. 
Notably, three genes were shared across these disease annotations 
and accounted for most of the enrichment: UBE3A, MECP2, and 
CDKLS (Supplemental Table S3). ASCs near these genes constitute 
promising candidates for exploring the molecular basis of primate- 
specific motor traits and also primate-specific aspects of human 
disease (Yasui et al. 2007). 

Encouraged by the preceding examples, we hypothesized that 
top ASC-enriched genes could reveal additional candidates for 
primate-specific disease biology. The most strongly ASC-enriched 
gene in the genome was the above-described anxiety-associated 
gene TMEM132B, followed by the oculocutaneous albinism II gene 
OCA2 (Fig. 3A,B; Supplemental Note 7). Overall, 14 of the top 
20 ASC-enriched genes were associated with human diseases re- 
lated to behavior, mood, motor coordination, vision, and hearing 
(Fig. 3D). Moreover, 302 ASCs contained SNPs associated with 
human diseases (Fig. 3C; Supplemental Table S5; Supplemental 
Note 8). Thus, we see numerous loci in the genome where primate- 
specific gene regulation may have altered the biology of specific 
human diseases. 

ASC192 is a primate-specific neurodevelopmental enhancer 

It is believed that developmental evolution occurs to a large extent 
through alterations in gene expression (Carroll 2003). However, 
experimentally validated examples of this phenomenon are scarce, 
particularly in the context of primate evolution. We prioritized 
ASC192 (constraint P = 5.8 x 10~ n ) (Supplemental Fig. S3) based 
on chromatin profiling data (Methods) and tested this element for 
tissue-specific enhancer function in day 11.5 (Ell. 5) transgenic 
mouse embryos. We found highly reproducible reporter-gene ex- 
pression in the central nervous system and eye, with midbrain 
being the strongest expression domain (Fig. 4; Supplemental Fig. 
S4). In stark contrast, the mouse and dog orthologs of ASC 192 
showed no reproducible activity in this assay (Fig. 4A; Supple- 
mental Figs. S5, S6). Thus, ASC192 represents an evolutionarily 
novel neurodevelopmental enhancer that arose in the ancestral 
lineage of anthropoid primates. 

We examined ASC 192 function at greater resolution using 
transverse embryonic sections (Fig. 4C). In the diencephalon, 
midbrain, and hindbrain, expression was localized to the alar 
ventricular, subventricular, marginal zones (VZ, SVZ, and MZ), and 



the entire roof plate neuroepithelium. Notably, the VZ and SVZ are 
sites of neurogenesis in the developing brain. ASC192 was also 
functional in the neural retina and dorsal spinal cord, which are 
again involved in neurogenesis. In order to uncover the target gene 
of ASC192, we used the 3C assay, which probes for physical in- 
teractions between the enhancer and flanking regions (Hagege 
et al. 2007). Human brain tissue is inaccessible at this devel- 
opmental stage. However, human embryonic stem cells (hESCs) 
provide a convenient alternative because they express all three 
genes proximal to ASC 192 (Supplemental Fig. S7). Furthermore, 
ASC 192 is marked by open chromatin in hESCs (Supplemental Fig. 
S8). In the 3C assay, ASC192 showed strong long-range inter- 
actions with the promoters of both POU2F1 (also known as OCT1 ) 
isoforms (Fig. 5). Thus, ASC 192 functions as an enhancer of the 
neurogenesis-related POU2F1 transcription factor (Kiyota et al. 
2008; Theodorou et al. 2009) and drives strong primate-specific 
expression in neurogenic zones during embryonic development. 

Massive contribution of TEs to new functional elements 

In order to investigate the origins of new anthropoid functional 
elements, we intersected ASCs with annotated human TEs. Al- 
though other estimates are even higher (de Koning et al. 201 1), the 
RepeatMasker track on the UCSC Genome Browser annotates 48% 
of the syntenic human genome (hgl9) as TE-derived (Supple- 
mental Note 9). Unexpectedly, 46% of ASC base pairs were TE- 
derived, indicating that the earlier estimates of 5.5% and 10% were 
biased by incomplete annotation of ancestral TEs (Fig. 6A; Lowe 
et al. 2007; Mikkelsen et al. 2007). As many as 56% of the ASC 
elements overlapped TEs by at least 50 bp. Note that alignment 
artifacts are unlikely to contribute significantly to our count of 
TE-derived ASCs (TE-ASCs) because we detected anthropoid se- 
quence constraint exclusively in quality-filtered multisyntenic 
global alignments (Methods). Moreover, manual examination of 
>100 randomly chosen ASCs failed to identify any such artifacts. 
Nevertheless, in order to independently confirm the functionality 
of TE-ASCs, we specifically examined their bias-corrected allele 
frequency distribution (Fig. 6B). Reassuringly, SNPs within TE- 
derived ASC subregions showed approximately the same effect on 
human fitness (enrichment for low-frequency alleles) as SNPs 
within ASCs as a whole. These results indicate that TE-ASCs have 
massively influenced anthropoid evolution, contributing 3.99 Mbp 
of newly constrained sequence distributed across 14,546 ASCs ge- 
nome-wide. Eighty-five percent of ASC base pairs are ancestral, i.e., 
shared with nonprimate mammals (Supplemental Fig. S9), and 40% 
of these are TE-derived. The remaining 15% of ASC base pairs have 
no ortholog beyond primates; 80% of these are TE-derived. 

Previous studies have highlighted specific ancient TEs, such as 
MER121 and LF-SINE, that show enrichment for pan-mammalian 
constraint (Bejerano et al. 2006; Kamal et al. 2006). However, these 
TEs were inserted over 180 Mya, and therefore their high level of 
sequence constraint may merely reflect the fact that nonconstrained 
instances are no longer detectable. In order to systematically exam- 
ine TE functional enrichment, we tested all RepeatMasker-annotated 
TE families for enrichment in AC base pairs relative to their over- 
all prevalence in the genome (Fig. 6C, blue bars; Supplemental 
Note 9). As expected, the older repeat families frequently showed 
strong overrepresentation in ACs, although this effect was only ob- 
served in small, ancient families with relatively few annotated 
genomic instances, such as "SINE?," "LTR?," and "Deu." The "DNA" 
TE family was the most recent family showing strong enrich- 
ment in ACs. However, even this family predates the mammalian 
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Figure 4. Primate-specific enhancer function of ASCI 92. (A) ASCI 92 drives consistent primate-specific lacZ expression in the developing central nervous 
system of El 1 .5 mouse embryos; four representative embryos are shown for each construct: (fb)forebrain; (mb) midbrain; (hb) hindbrain; (sc) spinal cord. 
For the entire set of transgenic embryos, see Supplemental Figures S4-S6. (6) Enhancer success rates for ASCI 92 and its orthologs: ASCI 92 drove strong 
lacZ expression in 10/14 transgenic embryos, whereas the mouse and dog orthologs of ASCI 92 drove strong reporter gene expression in only 1/8 
transgenics. (C) Transverse sections of a representative ASCI 92 embryo; strong expression is visible in forebrain, midbrain, and hindbrain (i, ii). lacZ 
expression coincides with neuroepithelial zones that spawn neural progenitors (iii, v) and also with neural retina (iv). In the spinal cord, lateral to the roof 
and floor plates, enhancer activity localizes to regions containing dorsal spinal interneurons and motor neuron progenitors (v). 
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Figure 5. 3C assay demonstrates looping between ASCI 92 and POU2F1 . (A) Black ticks indicate locations of primers designed to capture long-range 
DNA interactions with ASCI 92. Locus-wide relative interaction frequencies reveal that,4SC792 interacts most strongly with the two promoters of POU2F1 . 
Error bars represent one standard deviation. (6) DNase-seq signal at ASCI 92 and the POU2F1 promoter regions indicate an open chromatin conformation 
in human fetal brain. 



radiation — orthologs are detectable even in the highly diverged 
platypus genome. The average divergence of "DNA" repeats from 
their consensus sequence was calculated to be relatively low (0.27 
subs/site) only because half the genomic positions assigned to 
this repeat family were evolutionarily constrained. Overall, the 
pattern of TE enrichment within ACs is consistent with a model 
in which all TE families possess approximately the same func- 
tional potential. As previously suggested (Lowe at al. 2007; 
Mikkelsen et al. 2007), the ancient repeat families that do show 
functional enrichment most likely do so because their non- 
constrained instances have decayed so much that they are no longer 
recognizable as TEs by RepeatMasker. 



Due to their relatively recent origin, ASCs provide a unique 
opportunity to accurately examine the propensity of TEs from 
various families to become functional. We therefore assessed each 
TE family for enrichment in ASCs (Fig. 6C, red bars). In contrast to 
the ~ 12-fold maximum enrichment in ACs, we found that only 
one TE family (piggyBac) showed greater than threefold enrich- 
ment in ASCs, and 88% (36/41) showed less than twofold en- 
richment. Similarly, although 44% (18/41) TE families showed 
greater than twofold depletion in ACs, only 17% (7/41) showed 
such a depletion in ASCs. Moreover, three of these seven were of 
very recent origin ("Other," ERVK, and Alu), and therefore either 
largely or entirely too "young" to contribute functional elements 
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Figure 6. Massive contribution of transposable elements to ASCs. (A) Percent of human genome annotated as TE-derived compared to percent of 
lineage-specific constrained base pairs derived from TEs. TE contributions to eutherian- and mammal-specific constrained elements were estimated in 
previous studies (Lowe et al. 2007; Mikkelsen et al. 2007). (6) Derived allele frequency spectrum of AFR SNPs within TE-derived ASC subregions shows 
a similar enrichment at low frequencies, indicating that TE-ASCs are not noticeably enriched for false positives. (C) Enrichment was defined as the fraction 
of constrained (ASC or AC) base pairs attributable to a TE family divided by the fraction of the aligned genome attributable to the same family. TE family 
names were assigned by RepeatMasker. Family size was defined as the total size in base pairs of all TEs within the family. 



to the anthropoid ancestor. Overall, these results suggest that (1) 
most repeat families are similar in their propensity for contributing 
new functional elements to the genome; and (2) this propensity is 
similar to that of unique DNA. 

Gene battery model of TE exaptation applies to enhancers 
in vivo 

We sought to determine if ASCs derived from homologous TEs 
could act as functionally homologous enhancers in vivo as pre- 



dicted by the gene battery hypothesis (Britten and Davidson 1969). 
Given the neurodevelopmental themes arising from functional 
enrichment analysis of ASCs, we prioritized five ASCs derived from 
three closely related primate-specific subfamilies of the LI repeat 
family (L1PA13, L1PA15, L1PA16) (Khan et al. 2006), all of which 
showed strong DNase I hypersensitivity in human fetal brain 
(Methods). The mouse developmental stage equivalent to that of 
the human brain samples is not conducive to the lacZ enhancer 
assay. We therefore tested the five TE-ASCs for enhancer activity at 
an earlier developmental stage (E14.5 mouse embryos) (Fig. 7 A; 
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Figure 7. Shared enhancer function among three ASCs derived from primate-specific LI PA repeats. (A) Three ASC elements drove consistent lacZ 
expression in the developing eye at El 4.5. Three representative embryos are shown for each construct; the full sets can be found in Supplemental Figures 
SI 0-S1 2. (8) De novo motif discovery in the tested TE-ASCs uncovered binding motifs similar to the known motifs of TCF3 and SIX6. In situ hybridization, 
sagittal sections: Tcf3 and Six6 show strongest expression in the eye at El 4.5. 
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Supplemental Figs. S10-S14). Although the TE-ASCs drove lacZ 
staining in multiple brain regions, these expression domains were 
highly variable across the embryos. Interestingly however, three/ 
five TE-ASCs drove reproducible lacZ expression in the eye. The 
other two TE-ASCs also drove eye expression in individual embryos 
but not consistently enough to be scored as positives. In contrast, 
embryos transgenic for the empty lacZ vector (negative control) 
showed no eye staining (Supplemental Fig. S15). Upon further 
examination, we noticed that all five tested TE-ASCs flanked genes 
known to be specifically expressed or up-regulated in the eye 
(Supplemental Table S6). 

In order to infer the identities of TFs that may coherently 
regulate the LIPA-derived ASCs, we performed unbiased motif 
discovery in the L1PA subregions of the five ASCs using MEME 
(Bailey et al. 2009). The top-scoring motif thus identified matched 
the sequence specificity of TCF3 (Fig. 7B). The second- and third- 
ranked motifs had no database matches (data not shown). How- 
ever, the fourth-ranked motif matched the Drosophila TFs Optix 
and sine oculis, and also mammalian SIX6, all of which play crucial 
roles in eye development (Liu et al. 2006; Bharti et al. 2012). We 
therefore examined the expression patterns of Tcf3 and Six6 using 
in situ hybridization at E14.5. Notably, both TFs showed highly 
specific expression in the eye at this time point (Fig. 7B). It is 
possible that the shared presence of binding sites for these TFs 
contributed to the coherent eye-specific expression driven by the 
TE-ASCs. 

In order to explore the gene battery model at the subfamily 
level, we tested 429 TE-ASC subfamilies for functional coherence of 
their neighboring genes (Supplemental Table S7; Supplemental 
Note 10). Note that this analysis has limited statistical power, since 
ASCs represent only a small subset of newly functionalized se- 
quences, and also because TE subfamilies are narrowly defined to 
only include very highly homologous transposon relics. Perhaps 
for this reason, only one TE-ASC subfamily (L1MB5) showed sig- 
nificant functional enrichment relative to the set of all ASCs, after 
correcting for multiple testing. LlMB5-derived ASCs were enriched 
near genes containing the Mib-herc2 domain, which mediates 
Notch signaling. TE-ACs are greater in number than TE-ASCs and 
therefore provide greater statistical power for functional enrich- 
ment analysis. Consequently, we found a larger number of sig- 
nificantly coherent TE-AC subfamilies (33/617 tested) (Supple- 
mental Fig. S16; Supplemental Table S7). In particular, MER121, 
which has previously been noted for strong mammalian evolu- 
tionary constraint (Kamal et al. 2006), was highly enriched near 
reproductive system development and appendage morphogenesis 
genes. Other TE-AC subfamilies were enriched near genes related 
to adaptive immunity (MIRc), susceptibility to seizures (L2b), di- 
verse brain development functions (UCON28b), and synaptic 
plasticity (MIRb). Note that functional biases in ancestral mam- 
malian exaptation are not the only possible explanation for these 
results. It is conceivable, for example, that MER121 repeats actually 
contributed to a broad range of ancestral functions, but the mor- 
phogenetic subset showed enrichment in our analysis because 
morphogenesis gene regulatory elements are more "durable" over 
evolutionary time. 

Discussion 

Molecular origins of anthropoid primates 

We have exploited the availability of the marmoset genome sequence 
to perform the first genome-wide screen for anthropoid-specific 



functional elements. Using a stringent set of filters, we identified 
a large set of —24,000 ASCs covering ~9 Mbp of the human ge- 
nome and that shows strong genetic and chromatin-state signa- 
tures of functionality. The overwhelmingly nonexonic (97.4%) 
and noncoding (99.7%) nature of ASCs suggests that, when mea- 
sured by gain of constraint, anthropoid novelty is overwhelmingly 
attributable to gene regulatory changes (King and Wilson 1975). 

The genomic distribution of ASCs was highly nonrandom, 
suggesting strong selection for gain of specific functions in the 
lineage leading to anthropoid primates. The massive excess of 
ASCs near KZNFs suggests that, in addition to coding sequence 
evolution (Huntley et al. 2006), novel gene regulation may have 
played a major role in protecting the ancestral primate genome 
from retroviral transcription (Thomas and Schneider 2011). The 
TMEM1 32 locus showed the strongest ASC enrichment of all loci in 
the genome, perhaps reflecting altered regulation of TMEM132 
genes during primate evolution in response to selection for fear- 
related behavioral traits. These ASCs are promising candidates for 
exploring the molecular basis of the known differences in anxiety- 
related amygdala-prefrontal cortex circuitry between primates and 
rodents (Kalin et al. 2001). 

The broad-based enrichment of ASCs near GABA and other 
neurotransmitter pathway genes provides a striking molecular 
correlate to one of the unique features of primate cortical de- 
velopment, namely simultaneous overexpression of multiple neu- 
rotransmitter receptors during the early postnatal growth phase 
(Lidow et al. 1991). GABA genes were also responsible for the en- 
richment of ASCs in eye lens development loci. GABA signaling 
influences lens growth (Schwirtlich et al. 2011), and the primate 
lens is known for its ability to accommodate an exceptionally wide 
range of focal lengths (Borja et al. 2010). Thus, widespread anthro- 
poid gain of function in the GABA signaling pathway could po- 
tentially have played a role in primate visual acuity. 

The melanogenesis genes enriched in ASCs also control eye 
development and visual perception, which is significant in light of 
the well-known connections between melanin, eye development, 
and primate evolution (Kirkwood 2009). For example, the choroid 
layer of the primate eye has evolved higher melanin levels in order 
to reduce uncontrolled scattering of light behind the retina (Nickla 
and Wallman 2010). Thus, it is possible that ris-regulatory gain of 
function in these melanogenesis loci potentially contributed to 
improved visual acuity in primates (Martinez-Morales et al. 2004). 
Intriguingly, primates are also known for elevated levels of neu- 
romelanin in dopamine neurons (Marsden 1961). 

Finally, ASC enrichment near motor coordination and mi- 
crocephaly genes (UBE3A, MECP2, CDKL5, and others) suggests 
that gain of new regulatory elements may have contributed to 
primate motor adaptation and enlargement of the primate motor 
cortex (Kaas 2008) and also potentially to primate-specific aspects 
of neurodevelopmental diseases. Overall, ASCs show a strong and 
consistent trend of enrichment in gene loci with clear links to 
known anthropoid-specific phenotypes. Thus, they provide us with 
the first large set of candidate genomic elements for exploring the 
c/s-regulatory underpinnings of hallmark primate traits. 

We have validated ASC 192 as the first known neuro- 
developmental enhancer specific to primates. ASC192 drives ex- 
pression in neurogenic zones of the developing brain. Interestingly, 
POU2F1, the target gene of ASC192, is a driver of developmental 
neurogenesis (Theodorou et al. 2009) and a vital effector of radial 
glia formation in the VZ and SVZ (Kiyota et al. 2008). These links 
between POU2F1 and ASC 192 function suggest a possible role 
for the newly evolved enhancer in primate-specific neuronal pro- 
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liferation. It is thus possible that ASC192 may have contributed to 
the increased brain size and unique brain structure of anthropoid 
primates (Kriegstein et al. 2006). 

Origins of new functional elements 

Where do new functional elements come from? Many previous 
whole-genome studies have addressed this question using bio- 
chemical technologies such as ChlP-seq, DNase-seq, and DNA 
methylation profiling (Johnson et al. 2006; Marino-Ramirez and 
Jordan 2006; Wang et al. 2007; Kunarso et al. 2010; Kelley and Rinn 
2012; Schmidt et al. 2012; Chuong et al. 2013; Jacques et al. 2013; 
Xie et al. 2013). However, our results indicate that sequences 
showing lineage-specific biochemical activity have fewer fitness 
consequences than sequences showing lineage-specific constraint 
(Fig. 1C). In other words, they are biochemically functional but less 
enriched for biological function. Other whole-genome studies 
have taken the approach of mapping ancient (mammalian) con- 
strained elements and intersecting them with TEs (Lowe et al. 
2007; Mikkelsen et al. 2007). However, our results indicate that 
recent functionalization provides a cleaner lens through which to 
view the process of evolution, most likely because annotation of 
TEs is more reliable over shorter timescales (Fig. 6A). 

The 23,849 ASCs identified in this study constitute the first 
genome-wide data set that satisfies the two important criteria 
discussed above: They were recently functionalized, and they are 
also important for fitness. Remarkably, TEs contributed to >56% of 
ASCs, and new functional elements arose from repetitive and 
nonrepetitive DNA without any discrimination between the two 
categories (Fig. 6A). Just as remarkably, the 41 annotated repeat 
families all showed approximately the same propensity for con- 
tributing new functions once their prevalence in the genome was 
taken into account (Fig. 6C). Thus, our results support a new, el- 
egantly simple model of molecular evolution in which new 
functions arise more or less at random in the genome, with little 
regard to repeat status or repeat family. This is at variance with the 
conclusions of studies based on biochemical definitions of 
function, which detect massive enrichment of LTR repeats (Wang 
et al. 2007; Thurman et al. 2012; Chuong et al. 2013; Jacques et al. 
2013). The most parsimonious explanation for these divergent 
findings is that there exists in the genome a large number of 
biochemically active LTRs that have little or no biological func- 
tion. Another possible explanation is that the biochemically 
based studies examined function in specific cell types, whereas 
ASCs are cell-type agnostic. However, the enrichment of bio- 
chemically functional LTRs was observed across a very broad 
range of cell types (Jacques et al. 2013), suggesting that the former 
explanation is more likely. 

It has long been hypothesized that the existence of near- 
identical TE insertions within the upstream regions of functionally 
related genes could provide a mechanism for their coexpression 
(Britten and Davidson 1969). This hypothesis has recently found 
in vitro experimental support: A large set of MER20 LTR-derived 
sequences was shown to drive expression in mammalian endo- 
metrial cells (Lynch et al. 2011). In addition, RLTR13D5 insertions 
have been shown to drive reporter gene expression in rodent pla- 
cental cells (Chuong et al. 2013.). However, despite considerable 
speculation based on individual instances (Bejerano et al. 2006; 
Tashiro et al. 2011), there has so far been no corresponding evi- 
dence for enhancers driving gene batteries in vivo. We have used in 
vivo enhancer assays of ASCs to show that three highly homolo- 
gous TE-derived sequences (L1PA subfamilies) drive coherent ex- 



pression patterns in the developing eye. Experimental dissection 
of additional TE-ASC subgroups could potentially reveal many 
additional instances of enhancer-driven gene batteries in primate 
evolution. 

Methods 

Duration of primate-specific and human-specific evolution 

As defined in this study, anthropoid-specific functional elements 
arose at some point between the divergence of the nonprimate 
outgroup species (tree shrew) and the divergence of the marmoset 
and human lineages (Fig. 1A). These two divergence times have 
been estimated at —90 million years ago and —43 million years 
ago, respectively (Hedges et al. 2006). Consequently, ASCs arose 
over a time span of —47 million years (90 minus 43). In contrast, 
the evolutionary timespan since the human-chimpanzee di- 
vergence is much shorter: —6 million years (Hedges et al. 2006). 



Global alignment of multisyntenic regions 

DNA sequences of human (hgl9), three anthropoid primates, 
Pongo pygmaeus abelii (ponAbe2), Macaca mulatta (rheMac2), and 
Callithrix jacchus (caljac3), and three nonprimate mammals, Canis 
familiaris (canFam2), Mus musculus (mm9), and Equus caballus 
(equCab2), were downloaded from the UCSC Genome Browser 
(Kent et al. 2002) . Their respective pairwise chain and net alignments 
to human were also downloaded. To identify multispecies syntenic 
blocks, we divided the human genome into intervals of length >50 
kbp that were syntenic across all seven species, as evidenced by the 
existence of "Level-1" net alignments. Level-2 nets >50 kbp were 
used to fill gaps in Level 1, and Level-3 nets similarly filled gaps in 
Level 2. 

In each of the 3193 multisyntenic regions, we discarded 
nonhuman sequences to which too small a fraction of human 
bases was aligned in the "net." Percent alignment thresholds 
were selected empirically for each species based on the inflection 
point in the whole-genome distribution of aligned fractions 
(orangutan: 60%; rhesus macaque: 55%; marmoset: 45%; dog: 
30%; horse: 35%; and mouse: 30%). We discarded syntenic 
blocks if (1) either marmoset or dog was insufficiently aligned; 

(2) both rhesus and orangutan were insufficiently aligned; or 

(3) both mouse and horse were insufficiently aligned, resulting 
in 1876 filtered multisyntenic regions covering 2.57 Gbp of the 
human genome. 

We used the global aligner MLAGAN (Brudno et al. 2003) to 
align sequences within each multisyntenic block (see Supple- 
mental Table S8 for scoring matrices). Sequence positions with 
quality score <30 were masked to "N" in the respective genomes to 
avoid artifacts from sequencing errors. 



Scanning for constraint using Gumby 

Multiply aligned syntenic blocks were scanned for constrained 
segments using Gumby (Prabhakar et al. 2006b). We ran Gumby 
with a strict P-value threshold when scanning for constraint 
in anthropoids (P < 0.001) and a loose threshold when scann- 
ing for constraint among nonprimate mammals (P < 0.1). We 
detected constrained elements in multiple species sets: (1) human- 
marmoset-rhesus-orangutan; (2) human-marmoset-orangutan; (3) 
human-marmoset-rhesus; (4) dog-horse-mouse; (5) dog-horse; and 
(6) dog-mouse. For each set, we ran Gumby with two values of the 
"Ratio" parameter (2 and 5) and merged the resulting sets of con- 
strained regions. 
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Additional syntenic block filtering using Gumby alignment 
metrics and GC content 

Gumby fits the distribution of column scores of multiple-sequence 
alignments to a Gumbel distribution with parameters K and A. We 
discarded syntenic blocks with outlier K or A values, i.e., values that 
deviated from the 25th or 75th percentile for the chromosome by 
>1.5 times the inter-quartile range (IQR). We also filtered syntenic 
blocks with respect to GC-adjusted substitution rate. For each 
nonhuman species and each syntenic block, we linearly regressed 
the pairwise substitution rate with human against the GC content. 
If the regression residual of any nonhuman species within a syn- 
tenic block was atypical for the chromosome (by the same IQR 
criterion), the syntenic block was discarded. These filters yielded 
a final list of 1 713 "clean" syntenic blocks covering 2.55 Gbp of the 
human genome. The 268,521 anthropoid-constrained (AC) se- 
quences cover 4.2% of the syntenic genome. 

Scanning for mammal constraint using phastCons on 46-way 
MULTIZ alignment 

In contrast to the ACs, we used an extremely broad definition of 
nonprimate mammalian constraint (2.1 million elements, 11% of 
syntenic genome). Our mammalian constraint screen combined 
two independent algorithms, multiple parameter settings, rela- 
tively loose significance thresholds, and four different species sets, 
one aligned globally and three locally (Supplemental Fig. SI). The 
details are discussed below. 

To increase sensitivity in detecting nonprimate mammal 
constraint, we used phastCons (Siepel et al. 2005) to scan the 
subalignmnet of 25 nonprimate mammals (Supplemental Table 
S9) within the UCSC 46-species whole-genome MULTIZ alignment 
(Kent et al. 2002; Blanchette et al. 2004). Glire genomes have 
evolved more rapidly than those of other mammals, resulting in 
frequent nonfunctionalization of otherwise constrained sequences. 
Consequently, the inclusion of Glires in the 25-species set occa- 
sionally hinders detection of mammalian constraint. Conversely, 
some functional elements could be specific to Euarchontoglires and 
therefore constrained only in Glires and primates. For greater sen- 
sitivity, we therefore used phastCons to scan only the Glires (seven 
species) and also only the non-Euarchontoglire mammals (18 spe- 
cies). Again, for completeness, we ran phastCons twice for each of 
the three mammal subgroups (25, 18, and seven species), using two 
sets of parameters ("-rho 0.31-target-coverage 0.3-explen 45" and 
"-rho 0.5-target-coverage 0.2-explen 65"), for a total of six phastCons 
genome-wide constraint scans. We then filtered the phastCons 
constrained elements in order to have comparable genomic cover- 
age with the Gumby nonprimate mammal constrained elements. 
We discarded the rho = 0.5 elements whose phastCons score was less 
than 50 and used all phastCons constrained elements detected us- 
ing rho = 0.31. Overlapping constrained elements from the various 
scans were merged, resulting in a highly sensitive and comprehen- 
sive final set of 2.1M nonprimate mammal constrained elements 
covering 10.96% of the syntenic genome. 



Definition of anthropoid-specific constrained (ASC) sequences 
and validation using fastDNAmL 

We defined ASCs as anthropoid-constrained (AC) sequences with 
at most 10% of their length covered by mammal-constrained ele- 
ments, resulting in an initial set of 24,999 candidates. We used 
fastDNAmL (Olsen et al. 1994) to compute branch lengths (sub- 
stitution rates) within each candidate ASC and also the back- 
ground substitution rate within nonexonic regions of the same 
syntenic block. For each candidate element, the total substitution 



rate within the anthropoid (or nonprimate mammal) phylogenetic 
tree was calculated by summing over all lineages, and these ag- 
gregate substitution rates were compared to the corresponding 
background rates. We defined the anthropoid constraint factor as 



aCF: 



_ ^Jekm_tree 
^lsyn_tree 



where / represents the length of a phylogenetic tree branch, and the 
subscripts indicate the phylogenetic tree over which summation is 
performed. We incorporated uncertainty in branch length estima- 
tion by defining a "modified anthropoid constraint factor," which 
constitutes an upper bound on the constraint factor: 



maCF: 



^Jelem_tree + \J ^P'elem _ tree ^ 
^Jsyn_tree 



where a is the length uncertainty of an individual tree branch 
(95% confidence interval reported by fastDNAmL). Note that we 
have ignored uncertainties in background substitution rates be- 
cause these are typically small. We similarly computed the non- 
primate mammal constraint factors for each candidate ASC and 
also the constraint factors in the tree-shrew lineage. 

We declared an element constrained according to fastDNAmL 
if it satisfied constraint factor <0.7 and modified constraint factor 
<0.85. All 24,999 elements had anthropoid constraint factors be- 
low these thresholds, suggesting that Gumby constraint analysis 
produced no obvious false positives. By the same criterion, 1150 
candidate ASCs showed constraint either among nonprimate 
mammals or in the tree-shrew lineage. We discarded these 1150 
elements to obtain a final list of 23,849 validated ASCs, ranging in 
size from 77-5239 bp, with a median of 276 bp (Supplemental Fig. 
S2; Supplemental Table SI). 

Functional enrichment analysis of ASCs using GREAT 

We used GREAT (McLean et al. 2010, version 1.7.0) to determine 
whether nonexonic ASCs were enriched near genes belonging to 
specific functional categories. This enrichment analysis was per- 
formed using Fisher's exact test with the entire set of ACs as the 
"background" set. In other words, we were testing for ASC func- 
tional enrichment relative to ACs. We downloaded results for all 
20 ontology databases that can be accessed through GREAT. In 
order to harmonize statistical significance measures across the 
20 ontologies, we chose the conservative approach of recalculating 
the FDR Q- value of each functional category based on the P-values 
of all tests across all ontologies. In most cases, this had the effect 
of mildly increasing the Q-value originally calculated by GREAT, 
thus mildly reducing the estimated statistical significance. We 
used an FDR Q-value threshold of 0.01 and discarded functional 
categories in which ASCs were less than 1.2-fold enriched relative 
to ACs. 

Since the GO annotation is hierarchical, we pruned the GREAT 
results by deleting an enriched functional category if it was a "child" 
of a "parent" (superset) category that was enriched with a superior 
P-value. Conversely, if the parent category added nothing to the 
observed-expected statistic of the child, then the child category was 
retained. The other 1 7 ontologies were pruned by defining a func- 
tional category as a child of another category if all of the genes 
within the former were contained within the gene list of the latter. 
AC-gene assignments made by GREAT were used to ask if any in- 
dividual genes were enriched in ASCs, using the same Fisher's exact 
test approach (Supplemental Table S4). 
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To avoid "jackpot" effects arising from an abundance of 
ASCs near a single gene, we recomputed the P-value of each ASC- 
enriched functional category after removing the single most 
enriched gene. If the functional category was no longer significant 
at an uncorrected P-value threshold of 0.05, it was removed from 
the list. The complete set of GREAT results (FDR Q-value <0.01 and 
at least three foreground gene hits) are in Supplemental Table S3. 

Transgenic enhancer assay 

We shortlisted ASC candidates within the top 200 by P-value that 
showed DNase I hypersensitivity at all available developmental 
stages in human fetal brain (day 85 to day 142; NHGRI Epigenome 
Atlas) (Bernstein et al. 2010). We then filtered out all ASCs that 
overlapped a human mRNA. Among the resulting candidates, 
ASC192 (constraint P = 5.8 x 10~ n ) (Supplemental Fig. S3) was 
chosen because it showed the greatest hypersensitivity at day 85, 
the earliest time point. 

The lacZ reporter constructs for in vivo mouse embryonic en- 
hancer assays were created as previously described (Pennacchio 
et al. 2006). Embryos were collected and stained for lacZ activity at 
embryonic day E11.5/E14.5. In brief, all candidate enhancer se- 
quences were PCR-amplified from human, mouse, and dog geno- 
mic DNA (Roche) using high fidelity Pfx Polymerase (Invitrogen). 
PCR fragments were cloned into the pENTR plasmid (Invitrogen), 
transferred into the Gateway-compatible hsp68-lacZ reporter vector 
using LR recombination, and validated by Sanger sequencing. NotI 
digestion was used to excise the vector backbone from the reporter 
construct. The gel extracted fragment was further diluted and used 
for pronuclear microinjection of mouse zygotes. Pronuclear mi- 
croinjection of the DNA was performed by Cyagen Biosciences 
(USA) using standard procedures. Two rounds of injection were 
performed for each construct. Embryos were collected and stained 
for lacZ activity. The stained embryos were fixed with 4% para- 
formaldehyde overnight at 4°C, washed with phosphate buffer sa- 
line, dehydrated, and embedded in paraffin. Ten-micron-thick 
paraffin sections were obtained using a microtome (Leica RM2165). 
The air-dried sections were counterstained with eosin (pink stain) and 
mounted with di-n-butyl phthalate in xylene (DPX, Fischer Scien- 
tific). Zeiss Axio Imager Zl and Leica M205FA were used for imaging 
and documenting sectioned and whole embryos, respectively. 

The coordinates of the sequences tested in mouse embryo 
assays are given in Table 1. 

Motif discovery in LIPA-derived ASC bases 

The subregions otASC21 145, ASC7041, ASC 12942, ASC20100, and 
ASC12975 that intersect L1PA13,15,16 repeats were scanned for 
motifs using MEME (Bailey et al. 2009) with the following pa- 
rameter settings: "zoops" model of motif occurrence; minw = 6; 
maxw = 14; nmotifs = 20; and -revcomp. The discovered motifs 



Table 1. Coordinates of the sequences tested in mouse embryo 



assays 


Tested 


Genome 




sequence 


assembly 


Coordinates 


ASCI 92A 


hg19 


chrl :1 67,458,696-1 67,459,460 


mmASa 92A 


mm9 


chrl :1 67,744,043-1 67,744,743 


cfASC192A 


canFam2 


chr7:33,892,362-3,389,299 


ASC 21 145 


hg19 


chrl 4:39,41 7,480-39,41 7,91 7 


ASC 7041 


hg19 


chrl :69,762,676-69,763,499 


ASC 12942 


hg19 


chr6:73,763,531 -73,764,881 


ASC 201 00 


hg19 


chr8:35,669,458-35,670,886 


ASC 12975 


hg19 


chrX:1 21 ,904,222-1 21 ,905,024 



were matched to motifs in the TRANSFAC, UniPROBE, and JASPAR 
databases using TOMTOM (Gupta et al. 2007). 

The top motif discovered by MEME had an P-value of 0.067. 
This motif was matched by TOMTOM to the TCF3 UniPROBE 
motif (P = 0.0003 7). The fourth MEME motif had an P-value of 6 x 
10 2 , which exceeds the conventional E- value threshold of 0.1. We 
nevertheless followed up on this motif since it had a statistically 
significant match to a UniPROBE motif (SIX6; P = 0.001). 

Chromosome conformation capture (3C) 

We performed the 3C assay on ASC 192 and neighboring gene 
promoter regions according to a previously described protocol 
(Hagege et al. 2007). Briefly, —1.5 x 10 7 HI human embryonic 
stem cells were crosslinked at a final concentration of 1% formal- 
dehyde. The cells were lysed and their nuclei isolated. Hindlll was 
used as the primary restriction endonuclease to digest the nuclear 
samples, and the regions of chromosome contact were ligated us- 
ing the T4 DNA ligase. Following protein digestion, reverse cross- 
linking, and purification, physical interactions between genomic 
regions were detected by quantitative real-time PCR (qPCR) using 
PCR primers specific for each of the ligated interaction products. 
Primers were designed using Primer 3 and are listed in Supple- 
mental Table S10. To quantitatively compare signal intensities 
obtained from different primer sets, a control template containing 
all possible ligation products in equimolar amounts was used to 
correct for the PCR efficiency of each primer set. For this purpose, 
two BACs (CHI 7-214G02 and CH17-46K14) with minimal overlap 
spanning the locus of interest were digested and ligated as before. 
The ligated control fragments were diluted to appropriate con- 
centrations to obtain a final working concentration of 50 ng/mL 
total DNA, similar to that of the 3C templates. The control tem- 
plates were also used to draw standard curves for the quantitative 
real-time PCR as described in the protocol. 

Histology and RNA in situ hybridization 

Embryos were fixed, dehydrated in graded ethanol, and stored at 
-20°C. The cDNA clones of Tcf3 (IMAGE clone:2631291) and Six6 
(Riken clone: 4832418K20) were linearized with EcoRI and Nael, 
respectively, and used as templates for synthesizing antisense 
DIG-labeled Tcf3 and Six6 RNA probes (DIG RNA labeling kit, 
Roche). RNA in situ hybridization was performed essentially as 
previously described (Tribioli et al. 1997) on 10-u,m paraffin-em- 
bedded embryonic sections (Leica RM 2165 microtome). Following 
hybridization and washing, sections were stained with NBT/BCIP 
and exposed overnight at 4°C in the dark according to the man- 
ufacturer's instructions. Sections were washed in PBS, mounted 
with glycerol/gelatin, and subjected to imaging. 
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